How to improve your nflscrapR graphics

This resource is modeled after the fantastic BBC Graphics Cookbook, which is also worth checking out. The nflscrapR team (Maksim Horowitz, Ron Yurko, and Sam Ventura) have compiled easy to access play-by-play stats opening a deeper world of NFL analytics for reporters, bloggers and enthusiasts (and probably some NFL teams). Ben Baldwin has compiled a quickstart guide to using this data. As such, this resource is not aimed at reproducing that tutorial, but giving you some quick guides for improving the graphics you create via ggplot2. It’s easy to get started quickly exploring the data with ggplot2 and hopefully this helps with your “publication” quality plots.

I am providing a lot of my own opinion on certain dataviz choices - everyone is allowed to make their own decisions with regards to colors, ink use, chart type - but I do hope that this resource opens your eyes to some of the art of dataviz now that you have made progress with the science.

The source code for this webpage is on Github if you want to take a look.

Additional Resources

If you’d rather go deeper into a textbook and ignore specific applications related to nflscrapR, check out these amazing free online resources (some available in print as well):

Title/Link Author Description
R for Data Science Hadley Wickham, Garret Grolemund A great overview of the tidyverse, covers everything from reading data in, data manipulation/summarization, data viz, and general programming in R
SocViz Kieran Hiely Covers exactly HOW to create a lot of different plot types in R/ggplot2
Fundamentals of Data Viz Claus Wilke Covers the WHY of Data Viz where all examples are in R, but no code examples in the book, but are available on his GitHub
BBPlot Cookbook BBC Data Team Intro primer to news-style graphics in ggplot2
ggplot2 cookbook Winston Chang Quick cookbook of ggplot2 plots
R Graph Gallery Yan Holtz Cookbook examples of a majority of plot types.
ggplot2 Book Hadley Wickham, Danielle Navarro This 3rd edition of the ggplot2 book is currently under development, but also available freely online for the first time! A more technical book that should align well with either SocViz or Fundamentals of Data Viz

Useful code chunks

There are a couple features that we will use throughout these examples:

dplyr::if_else()

This allows you to make a binary conversion.

For example if_else(condition, true, false)

  • mutate(success = if_else(epa > 0, 1, 0))
  • mutate(color = if_else(posteam == "PIT", "yellow", "grey))

dplyr::case_when()

This allows you to essentially use many if_else statements at once

  • The ~ indicates an assignment, where if the left side statement is evaluated as TRUE then the outcome is ~ (assigned) to the right side.
    • The right side can be a number, text, etc
    • The left side can be a simple or complex statement, but must evaluate as TRUE/FALSE (logical)
  • The final TRUE ~ NA_character_ is basically a “catch” - if none of the other cases are met, then it will default to NA
    • In this case we use NA_character_ from dplyr, but you could also have a situation where it could simply say “nope” or revert back to some other column
    • If you want to have the right side (assignment) be a number, you’ll need to use NA_integer_
  • Lastly, a longer case_when() is presented shortly below
pbp %>%
  mutate(
    stick_throw = case_when(
      air_yards < ydstogo ~ "Short of Sticks",
      air_yards == ydstogo ~ "At Stick",
      air_yards > ydstogo ~ "Past Stick",
      TRUE ~ NA_character_
    )
  ) %>%
  select(air_yards, ydstogo, stick_throw) %>%
  filter(!is.na(air_yards))
## # A tibble: 17,669 x 3
##    air_yards ydstogo stick_throw    
##        <dbl>   <dbl> <chr>          
##  1         8      15 Short of Sticks
##  2         4      10 Short of Sticks
##  3        -3      10 Short of Sticks
##  4        24      10 Past Stick     
##  5         1       1 At Stick       
##  6         4       8 Short of Sticks
##  7         6       4 Past Stick     
##  8        16      10 Past Stick     
##  9        -9      13 Short of Sticks
## 10         2      10 Short of Sticks
## # … with 17,659 more rows

scale_color_identity()

This is useful in combination with the above example of assigning color in a plot, essentially it will take the “yellow” or “grey” argument automatically.

scale_color_manual()

This allows you to specify colors of interest like scale_color_manual(values = c("red", "black"))

forcats::reorder()

This allows you to reorder levels of a ggplot by another variable.

eg reorder(posteam, epa)

Helpers

There are a few helpers used frequently throughout.

  • ! indicates not or negation, so x != 5 means x not equal to 5.
    • !is.na(x) indicates x is NOT NA
  • %in% means in - so x %in% c(2, 3, 4) means x matches 2, 3 OR 4
  • dplyr::between(x, left, right) - shortcut for x >= left & x <= right
  • hjust/vjust - this is typically assigned 0 through 1, and adjusts either the horizontal or vertical alignment

ggplot2 specs

The documentation for ggplot2 cover in great detail MANY options for minor but important customizations. I’m not adding it directly here but adding as a resource. It is definitely worth parsing through, and some examples below:

  • lines (size, color, type, join, end)
  • points (size, color, fill, stroke)
  • text (size, face)
  • justification (hjust, vjust, nudge_x, nudge_y)

teamcolors package

Gives you ALL the colors for NFL teams

Using teamcolors

filter(teamcolors, league == "nfl")
## # A tibble: 32 x 8
##    name    league primary secondary tertiary quaternary division logo      
##    <chr>   <chr>  <chr>   <chr>     <chr>    <chr>      <chr>    <chr>     
##  1 Arizon… nfl    #97233f #000000   #ffb612  #a5acaf    NFC West http://co…
##  2 Atlant… nfl    #a71930 #000000   #a5acaf  #a30d2d    NFC Sou… http://co…
##  3 Baltim… nfl    #241773 #000000   #9e7c0c  #c60c30    AFC Nor… http://co…
##  4 Buffal… nfl    #00338d #c60c30   #0c2e82  #d50a0a    AFC East http://co…
##  5 Caroli… nfl    #0085ca #000000   #bfc0bf  #0085ca    NFC Sou… http://co…
##  6 Chicag… nfl    #0b162a #c83803   #0b162a  #c83803    NFC Nor… http://co…
##  7 Cincin… nfl    #000000 #fb4f14   #000000  #d32f1e    AFC Nor… http://co…
##  8 Clevel… nfl    #fb4f14 #22150c   #a5acaf  #d32f1e    AFC Nor… http://co…
##  9 Dallas… nfl    #002244 #b0b7bc   #acc0c6  #a5acaf    NFC East http://co…
## 10 Denver… nfl    #002244 #fb4f14   #00234c  #ff5200    AFC West http://co…
## # … with 22 more rows

Please note that teams are listed by full name so to use them with the play-by-play data you will need to “join” the teamcolors and play-by-play datasets together.

The list of short teams named could be accomplished like so:

nfl_colors <- teamcolors %>%
  filter(league == "nfl") %>%
  mutate(
    team_abb = case_when(
      name == "Arizona Cardinals" ~ "ARI",
      name == "Atlanta Falcons" ~ "ATL",
      name == "Baltimore Ravens" ~ "BAL",
      name == "Buffalo Bills" ~ "BUF",
      name == "Carolina Panthers" ~ "CAR",
      name == "Chicago Bears" ~ "CHI",
      name == "Cincinnati Bengals" ~ "CIN",
      name == "Cleveland Browns" ~ "CLE",
      name == "Dallas Cowboys" ~ "DAL",
      name == "Denver Broncos" ~ "DEN",
      name == "Detroit Lions" ~ "DET",
      name == "Green Bay Packers" ~ "GB",
      name == "Houston Texans" ~ "HOU",
      name == "Indianapolis Colts" ~ "IND",
      name == "Jacksonville Jaguars" ~ "JAX",
      name == "Kansas City Chiefs" ~ "KC",
      name == "Los Angeles Rams" ~ "LA",
      name == "Los Angeles Chargers" ~ "LAC",
      name == "Miami Dolphins" ~ "MIA",
      name == "Minnesota Vikings" ~ "MIN",
      name == "New England Patriots" ~ "NE",
      name == "New Orleans Saints" ~ "NO",
      name == "New York Giants" ~ "NYG",
      name == "New York Jets" ~ "NYJ",
      name == "Oakland Raiders" ~ "OAK",
      name == "Philadelphia Eagles" ~ "PHI",
      name == "Pittsburgh Steelers" ~ "PIT",
      name == "Seattle Seahawks" ~ "SEA",
      name == "San Francisco 49ers" ~ "SF",
      name == "Tampa Bay Buccaneers" ~ "TB",
      name == "Tennessee Titans" ~ "TEN",
      name == "Washington Redskins" ~ "WAS",
      TRUE ~ NA_character_
    ),
    posteam = team_abb
  )

You could then use dplyr::left_join() to join the full names, colors, and team logos to the play-by-play data. Without getting into the weeds TOO much, a left_join basically finds cases where there is a matching row in the common column (posteam) for both dataframes, and then adds the additional columns from nfl_colors to the play-by-play data. Joins are a very important concept when trying to combine multiple datasets, and if you want to read more about the various types and their use cases check out the dplyr joins docs.

Quick example below:

# read in data
pbp <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/play_by_play_data/regular_season/reg_pbp_2018.csv")
# left_join the data together
pbp_colors <- left_join(pbp, nfl_colors, by = c("posteam"))

pbp_colors %>%
  # Excludes non-plays, eg end of quarter
  filter(!is.na(posteam)) %>%
  select(posteam, team_abb, name, primary, secondary, logo) %>%
  # Distinct grabs only the distinct/unique cases of column
  distinct(posteam, .keep_all = TRUE)
## # A tibble: 32 x 6
##    posteam team_abb name       primary secondary logo                      
##    <chr>   <chr>    <chr>      <chr>   <chr>     <chr>                     
##  1 ATL     ATL      Atlanta F… #a71930 #000000   http://content.sportslogo…
##  2 PHI     PHI      Philadelp… #004953 #a5acaf   http://content.sportslogo…
##  3 BAL     BAL      Baltimore… #241773 #000000   http://content.sportslogo…
##  4 BUF     BUF      Buffalo B… #00338d #c60c30   http://content.sportslogo…
##  5 JAX     JAX      Jacksonvi… #000000 #006778   http://content.sportslogo…
##  6 NYG     NYG      New York … #0b2265 #a71930   http://content.sportslogo…
##  7 NO      NO       New Orlea… #9f8958 #000000   http://content.sportslogo…
##  8 TB      TB       Tampa Bay… #d50a0a #34302b   http://content.sportslogo…
##  9 NE      NE       New Engla… #002244 #c60c30   http://content.sportslogo…
## 10 HOU     HOU      Houston T… #03202f #a71930   http://content.sportslogo…
## # … with 22 more rows

So we can see that the posteam and team_abb are equivalent, where the full team name, colors, and logo are also added. I dropped the other 250+ columns for printing here, but they would be in the complete dataframe.

ggsave()

If you are going to export your graphics, it’s worth it to go through ggsave() rather than the RStudio export button.

The full docs have lots of great info but I’ll summarize it here. The basic arguments in pseudocode are below.

ggsave("plot_name.png", plot_object,
       height = x, width = y, units = "in", dpi = "300")

A typical call of ggsave would look like the below.

ggsave("wr_epa.png", wr_epa_plot, 
       height = 6, width = 8, units = "in", dpi = "350")

Arguably, the most important part is the DPI call - if you save through the export button you will typically have a low DPI (72) that has jagged edges on lines, as opposed to exporting with a higher DPI.

You will likely spend some time perfecting the print size of your plots, but if you use your own theme with text sized appropriately you can typically set a specific DPI and work from there.

Changing fonts

Changing fonts for graphics in R can be easy if you use a package like extrafont or showtext. You can then change font family in your theme calls or as part of your personal theme.

extrafont has an example walking through it’s use.

showtext has an example walking through it’s use.

Prep

Load all the libraries you need

There are a few packages I will use in this guide, most of them related to data viz.

library(tidyverse) # Data Cleaning, manipulation, summarization, plotting
library(gt) # beautiful tables
library(DT) # beautiful interactive tables
library(ggthemes) # custom pre-built themes
library(bbplot) # more themes
library(ggtext) # custom text color
library(teamcolors) # NFL team colors and logos
library(ggforce) # better annotations
library(ggridges) # many distributions at once
library(ggrepel) # better labels
library(ggbeeswarm) # beeswarm plots
library(extrafont) # for extra fonts

Read in the pbp data

This is taken almost verbatim from Ben’s Tutorial, but the idea is that you are adjusting the dataset to be ready for analysis. If you are interested in plays beyond pass/rush then you should probably NOT do these steps.

pbp <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/play_by_play_data/regular_season/reg_pbp_2018.csv")
# clean up the data for further analysis
pbp_rp <- pbp %>%
  # grab only penalties, pass, and run plays
  filter(!is.na(epa), play_type == "no_play" | play_type == "pass" | play_type == "run") %>%
  # create pass, rush and success columns
  mutate(
    pass = if_else(str_detect(desc, "(pass)|(sacked)|(scramble)"), 1, 0),
    rush = if_else(str_detect(desc, "(left end)|(left tackle)|(left guard)|(up the middle)|(right guard)|(right tackle)|(right end)") & pass == 0, 1, 0),
    success = ifelse(epa > 0, 1, 0)
  ) %>%
  # filter to only pass or rush plays
  filter(pass == 1 | rush == 1) %>%
  mutate(
    passer_player_name = ifelse(play_type == "no_play" & pass == 1,
      str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
      passer_player_name
    ),
    receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"),
      str_extract(
        desc,
        "(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"
      ),
      receiver_player_name
    ),
    rusher_player_name = ifelse(play_type == "no_play" & rush == 1,
      str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)|      (up the middle)|(right guard)|(right tackle)|(right end)))"),
      rusher_player_name
    )
  ) %>%
  mutate(
    name = if_else(!is.na(passer_player_name), passer_player_name, rusher_player_name),
    rusher = rusher_player_name,
    receiver = receiver_player_name,
    play = 1
  )

Our first data summary

This is also credited to Ben:

“Let’s look at which teams were the most pass-heavy in the first half on early downs with win probability between 20 and 80, excluding the final 2 minutes of the half when everyone is pass-happy:”

schotty <- pbp_rp %>%
  filter(wp > .20 & wp < .80 & down <= 2 & qtr <= 2 & half_seconds_remaining > 120) %>%
  group_by(posteam) %>%
  summarize(mean_pass = mean(pass), 
            plays = n()) %>%
  arrange(mean_pass)

schotty
## # A tibble: 32 x 3
##    posteam mean_pass plays
##    <chr>       <dbl> <int>
##  1 SEA         0.369   320
##  2 JAX         0.435   276
##  3 TEN         0.441   263
##  4 BUF         0.452   219
##  5 BAL         0.458   299
##  6 ARI         0.466   236
##  7 NYJ         0.473   256
##  8 DET         0.482   299
##  9 WAS         0.485   239
## 10 CAR         0.491   281
## # … with 22 more rows

“The Seahawks were playing a different sport in 2018. Fun! Let’s see what that looks like:”

ggplot(schotty, aes(x = reorder(posteam,-mean_pass), y = mean_pass)) +
        geom_text(aes(label = posteam))

Now this is a useful plot, but as Ben said: “This image is kind of a mess – we still need a title, axis labels, etc – but gets the point across. We’ll get to that other stuff later.”

Let’s get to that stuff now!

Themes

ggplot2 out of the box comes with a bunch of themes, things like theme_bw(), theme_minimal(), theme_classic(), and the default theme_grey().

Let’s see what they look like with the same plot as above.

theme_bw()

  • Notice that we now have grey gridlines, a black border, and a white background.

theme_minimal()
- Notice that we still have grey gridlines, a white background, but now no black border.

theme_classic()

  • Notice that we now have NO gridlines, a half black border, and the same white background.

But as with almost everything in R, there are more packages that add more functionality! In this case, there are entire packages dedicated to themes in ggplot2 and you have the ability to build your own themes!

More themes

library(ggthemes)
library(bbplot)

The ggthemes package gives you a wide assortment of additional themes as seen here. Most importantly it also gives you ideas about customizations to your personal theme. If you parse through the source code, you can create your own theme and utilize across your visualizations.

theme_fivethirtyeight()

  • The difference from theme_minimal() is ironically, minimal but the main difference is heavier grey gridlines, and a subtle grey background - which aligns with the FiveThirtyEight style.

Again, the exciting part about ggthemes in my mind is the concept of creating your own theme. In fact, the code for this theme is pretty simple!

theme_fivethirtyeight <- function(base_size = 12, base_family = "sans") {
  colors <- deframe(ggthemes::ggthemes_data[["fivethirtyeight"]])
  (theme_foundation(base_size = base_size, base_family = base_family)
  + theme(
      line = element_line(colour = "black"),
      rect = element_rect(
        fill = colors["Light Gray"],
        linetype = 0, colour = NA
      ),
      text = element_text(colour = colors["Dark Gray"]),
      axis.title = element_blank(),
      axis.text = element_text(),
      axis.ticks = element_blank(),
      axis.line = element_blank(),
      legend.background = element_rect(),
      legend.position = "bottom",
      legend.direction = "horizontal",
      legend.box = "vertical",
      panel.grid = element_line(colour = NULL),
      panel.grid.major =
        element_line(colour = colors["Medium Gray"]),
      panel.grid.minor = element_blank(),
      plot.title = element_text(hjust = 0, size = rel(1.5), face = "bold"),
      plot.margin = unit(c(1, 1, 1, 1), "lines"),
      strip.background = element_rect()
    ))
}

Edited Theme

I personally edited this so that it didn’t remove axis titles, and to have a white background instead of gray, which you can see below.

theme_538 <- function(base_size = 12, font = "Lato") {

  # Text setting
  txt <- element_text(size = base_size + 2, colour = "black", face = "plain")
  bold_txt <- element_text(
    size = base_size + 2, colour = "black",
    family = "Montserrat", face = "bold"
  )
  large_txt <- element_text(size = base_size + 4, color = "black", face = "bold")


  theme_minimal(base_size = base_size, base_family = font) +
    theme(
      # Legend Settings
      legend.key = element_blank(),
      legend.background = element_blank(),
      legend.position = "bottom",
      legend.direction = "horizontal",
      legend.box = "vertical",

      # Backgrounds
      strip.background = element_blank(),
      strip.text = large_txt,
      plot.background = element_blank(),
      plot.margin = unit(c(1, 1, 1, 1), "lines"),

      # Axis & Titles
      text = txt,
      axis.text = txt,
      axis.ticks = element_blank(),
      axis.line = element_blank(),
      axis.title = bold_txt,
      plot.title = large_txt,

      # Panel
      panel.grid = element_line(colour = NULL),
      panel.grid.major = element_line(colour = "#D2D2D2"),
      panel.grid.minor = element_blank(),
      panel.background = element_blank(),
      panel.border = element_blank()
    )
}

Now let’s see what the edited theme looks like in action!

ggplot(schotty, aes(x = reorder(posteam, -mean_pass), y = mean_pass)) +
  geom_text(aes(label = posteam)) +
  theme_538()

Regardless - the idea here is that you can:

  • Use a built in theme (theme_bw, theme_minimal, etc)
  • Use a pre-built theme (bbplot, ggthemes, etc)
  • Or build your own theme!

All are valid, but you don’t necessarily have to actually manually code the theme element changes to each and every plot. You can at the least write your own theme as a function and use it. Alternatively, you can write your own package (easier than it sounds!) and source that.

If you would like to read more about customizing your OWN theme - check out the great resource by Simon Jackson at his blog.

Line Charts

Basic line chart = ggplot() + geom_line()

# Prepare data
wr_duel <- pbp_rp %>%
  filter(receiver %in% c("A.Brown", "J.Smith-Schuster")) %>%
  group_by(game_date, receiver) %>%
  summarize(mean_epa = mean(epa, na.rm = TRUE))

ggplot(
  wr_duel,
  aes(x = game_date, y = mean_epa, color = receiver)
) +
  geom_line(size = 1)

Let’s improve this a bit.

wr_duel_plot <- ggplot(
  wr_duel,
  aes(x = game_date, y = mean_epa, color = receiver)
) +
  geom_line(size = 1) +
  theme_538() +
  geom_hline(yintercept = 0, size = 1, color = "black") +
  labs(
    x = "\nGame Date",
    y = "EPA (Average)",
    title = "Quick comparison of AB vs Juju across the 2018 season",
    caption = "Data: @nflscrapR"
  )

wr_duel_plot

Change the colors

But we can still improve this a lot - it feels a bit crowded, plus the red/blue colro scheme doesn’t align with the team’s color or anything else. We can add colored text via the ggtext package, or we can manually change the colors. Also note that you can grab the team’s colors via teamcolors package.

pit_colors <- teamcolors %>% 
  filter(name == "Pittsburgh Steelers") %>%
  select(name:secondary)

pit_colors
## # A tibble: 1 x 4
##   name                league primary secondary
##   <chr>               <chr>  <chr>   <chr>    
## 1 Pittsburgh Steelers nfl    #000000 #ffb612
pit_primary <- pull(pit_colors, primary)
pit_secondary <- pull(pit_colors, secondary)

Asign the colors

wr_duel_plot <- ggplot(wr_duel,
               aes(x = game_date, y = mean_epa, 
                   color = if_else(receiver == "A.Brown", pit_primary, pit_secondary))) +
  geom_line(size = 1) +
  theme_538() +
  geom_hline(yintercept = 0, size = 1, color = "black") +
  labs(x = "",
       y = "EPA (Average)",
       title = "Quick comparison of <span style='color:#000000'>**AB**</span> vs <span style='color:#ffb612'>**Juju**</span> across the 2018 season",
       caption = "Data: @nflscrapR") +
  scale_color_identity() +
  theme(plot.title = element_markdown())

wr_duel_plot

Add the legend back

Alternatively, if you didn’t want to drop a legend, you could approach it this way.

wr_duel_plot <- ggplot(
  wr_duel,
  aes(
    x = game_date, y = mean_epa,
    color = receiver
  )
) +
  geom_line(size = 1) +
  theme_538() +
  geom_hline(yintercept = 0, size = 1, color = "black") +
  labs(
    x = "",
    y = "EPA (Average)",
    title = "Quick comparison of AB vs Juju across the 2018 season",
    caption = "Data: @nflscrapR"
  ) +
  scale_color_manual(values = c(pit_primary, pit_secondary)) +
  theme(
    legend.title = element_blank(),
    legend.position = c(0.2, 0.1)
  )

wr_duel_plot

Or try direct labeling!

wr_duel_plot +
  theme(legend.position = "none") +
  geom_text(data = filter(wr_duel, game_date == "2018-09-09"),
            aes(x = game_date, y = mean_epa, label = receiver),
            hjust = 0, nudge_y = 0.1, size = 4
  ) +
  geom_point(data = filter(wr_duel, game_date == "2018-09-09"), 
             size = 3
  )

More than a linear time series

You could also try out a connected line plot which lets you plot an x and y axis, then assign “time” as a 3rd variable. I find that it helps to add a connecting line, or else you may have trouble following the linear change in time.

# Prepare data
juju_do_it <- pbp_rp %>%
  filter(receiver == "J.Smith-Schuster") %>%
  arrange(desc(game_date)) %>%
  group_by(game_date) %>%
  summarize(
    total_yards = sum(yards_gained, na.rm = TRUE),
    total_airyards = sum(air_yards, na.rm = TRUE)
  ) %>%
  head(5) %>%
  mutate(
    game_num = row_number(),
    game_text = glue::glue("Game {game_num}")
  )

ggplot(juju_do_it, aes(x = total_airyards, y = total_yards, color = game_num)) +
  # geom path follows the order of underlying data
  geom_path(size = 2) +
  geom_point(size = 5) +
  # creates a line for comparison
  geom_abline(intercept = 0, slope = 1, color = "grey", linetype = "dashed") +
  # adds labels to only game 1 and 5
  geom_text(
    data = filter(juju_do_it, game_num %in% c(1, 5)),
    aes(label = game_text),
    hjust = 1, nudge_x = -5
  ) +
  # set scales for 0-axis
  scale_x_continuous(limits = c(0, 140)) +
  scale_y_continuous(limits = c(0, 140)) +
  # change color gradient to start at black and transition to yellow
  scale_color_gradient(low = pit_primary, high = pit_secondary) +
  theme_538() +
  labs(
    x = "\nTotal Air Yards",
    y = "Total Yards\n",
    title = "Even with his highest Air Yardage, Juju struggled in Game 4",
    caption = "Data: @nflscrapR"
  ) +
  theme(legend.position = "none")

Notice that the above plot has a diagonal trend line that runs intersecting at 0 with a slope of 1:

geom_abline(intercept = 0, slope = 1, color = "grey", linetype = "dashed")

This basic call can be applied to lots of different plots to give a reference line, where you can separate plays/players/teams into above/below the line.

Bar Charts

Everyone’s favorite - bar charts! But always remember that bar charts can limit information - we’ll look at distribution plots of various types later, but for now back to the bar.

Basic forms:

  • ggplot(aes(x = category, y = value)) + geom_col()
  • ggplot(aes(x = category, y = value)) + geom_bar(stat = "identity")

Column defaults to identity, essentially the single number is read as the max value. geom_bar() on the other hand has a bit more flexibility if you want to build stacked or segmented bar charts.

rb_trio <- pbp_rp %>%
  filter(
    posteam == "PIT",
    receiver %in% c("J.Conner", "J.Samuels", "S.Ridley") |
      rusher %in% c("J.Conner", "J.Samuels", "S.Ridley"),
    play_type != "no_play"
  ) %>%
  mutate(
    # Assign a single player name for filtering regardless of play_type
    player = if_else(is.na(receiver), rusher, receiver),
    # Add nice labels to play_type
    play_type = factor(play_type, labels = c("Reception", "Rush"))
  ) %>%
  group_by(player, play_type) %>%
  summarize(
    n = n(),
    mean_yards = sum(yards_gained, na.rm = TRUE) / n,
    mean_success = sum(success, na.rm = TRUE) / n
  )

rb_trio_plot <- rb_trio %>%
  ggplot(aes(x = player, y = mean_yards)) +
  geom_col(aes(fill = play_type), position = "dodge")

rb_trio_plot

Something to notice above - we have created a “grouped” bar chart, where the bars are grouped by player and color is assigned to play type. We can split this out into facets as an alternative representation.

Facets

rb_trio_plot <- rb_trio %>%
  ggplot(aes(x = player, y = mean_yards, fill = player, position = "dodge", group = play_type)) +
  geom_col() +
  facet_grid(~play_type)

rb_trio_plot

Now we are adding color by player and separating into small multiples or facets that represent the play type. Any categorical variable could be used in this fashion - you could essentially build the plot 1x and then facet by a factor to generate N versions of that graph all plotted together.

Let’s raise the bar

rb_trio_plot +
  geom_hline(yintercept = 0.03, color = "black", size = 2) +
  theme_538() +
  scale_fill_manual(values = c(pit_primary, pit_secondary, "grey")) +
  labs(
    x = "",
    y = "Avg Yards per Play",
    title = "Conner and Samuels were interchangeable in 2018",
    subtitle = "Ridley is no longer on the team",
    caption = "Data: @nflscrapR"
  ) +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_line(color = "white", size = 1),
    panel.ontop = TRUE,
    legend.position = "none"
  )  +
  scale_y_continuous(
    breaks = seq(0, 6, 1)
  )

Or keep it more traditional

rb_trio_plot +
  geom_hline(yintercept = 0, color = "black", size = 2) +
  theme_538() +
  scale_fill_manual(values = c(pit_primary, pit_secondary, "grey")) +
  labs(
    x = "",
    y = "Avg Yards per Play",
    title = "Conner and Samuels were interchangeable in 2018",
    subtitle = "Ridley is no longer on the team",
    caption = "Data: @nflscrapR"
  ) +
  theme(
    panel.grid.major.x = element_blank(),
    legend.position = "none"
  ) +
  scale_y_continuous(
    breaks = seq(0, 6, 1)
  )

Flip the bar

epa_play <- pbp_rp %>% 
  filter(pass == 1) %>% 
  group_by(posteam) %>% 
  summarize(
    n = n(),
    epa_per_db = sum(epa, na.rm = TRUE) / n,
    success_rate = sum(epa) / n
  )

epa_play %>% 
  ggplot(aes(x = posteam, y = epa_per_db)) +
  geom_col()

This could be a useful summary, but there’s a few issues.

  • Teams arranged by alphabetical name, which is not that useful
  • x-axis is hard to read (worse so if you had full team names)

So let’s try rotating the bar plot.

epa_play %>%
  ggplot(aes(x = epa_per_db, y = reorder(posteam, epa_per_db), )) +
  geom_col()

Yikes - that is not what we want! Instead of just swapping the x and y axes, we should have used coord_flip() - this will actually rotate the plot rather than change the structure.

epa_play %>%
  ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
  geom_col(aes(fill = if_else(epa_per_db >= 0, "green", "red"))) +
  coord_flip() +
  scale_fill_identity()

Now this is more readable, clearly arranged by the strong passing vs weak passing teams, but still could be improved. Namely, red/green is not ideal for color-blindness, and the default red/green are pretty abrasively bright! However, we can still improve the grid lines (don’t need horizontal), add some better labels, and finish out the plot.

epa_play %>%
  ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
  geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"))) +
  coord_flip() +
  scale_fill_identity() +
  theme_538() +
  theme(panel.grid.major.y = element_blank()) +
  geom_hline(yintercept = 0) +
  scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
  labs(
    x = "",
    y = "EPA per Dropback",
    title = "The majority of teams had positive EPA/dropback",
    subtitle = "But there are some clear outliers",
    caption = "Data: @nflscrapR"
  )

Bar plot alternatives

There are some alternative reproducible methods for various bar plots on one of my other guides.

How about a lollipop?

Basic form:

  • ggplot(aes(x = category, y = value)) + geom_col(width = 0.2) + geom_point()
epa_play %>%
  ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
  geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
    width = 0.2
  ) +
  geom_point(aes(color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
    size = 5
  ) +
  coord_flip() +
  scale_fill_identity(aesthetics = c("fill", "colour")) +
  theme_538() +
  theme(panel.grid.major.y = element_blank()) +
  geom_hline(yintercept = 0) +
  scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
  labs(
    x = "",
    y = "EPA per Dropback",
    title = "The majority of teams had positive EPA/dropback",
    subtitle = "But there are some clear outliers",
    caption = "Data: @nflscrapR"
  )

Or a direct labeled bar

epa_play %>%
  ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
  geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"))) +
  geom_text(aes(
    label = posteam,
    color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"),
    hjust = if_else(epa_per_db > 0, -0.1, 1.1)
  )) +
  coord_flip() +
  scale_fill_identity(aesthetics = c("fill", "colour")) +
  theme_538() +
  theme(
    panel.grid.major.y = element_blank(),
    axis.text.y = element_blank()
  ) +
  geom_hline(yintercept = 0) +
  scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
  labs(
    x = "",
    y = "EPA per Dropback",
    title = "The majority of teams had positive EPA/dropback",
    subtitle = "But there are some clear outliers",
    caption = "Data: @nflscrapR"
  )

Or dropping the bar completely

epa_play %>%
  ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
  geom_point(aes(color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
    size = 3
  ) +
  geom_text(aes(
    label = posteam,
    color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"),
    hjust = if_else(epa_per_db > 0, -0.2, 1.2)
  )) +
  coord_flip() +
  scale_fill_identity(aesthetics = c("fill", "colour")) +
  theme_538() +
  theme(
    panel.grid.major.y = element_blank(),
    axis.text.y = element_blank()
  ) +
  geom_hline(yintercept = 0) +
  scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
  labs(
    x = "",
    y = "EPA per Dropback",
    title = "The majority of teams had positive EPA/dropback",
    subtitle = "But there are some clear outliers",
    caption = "Data: @nflscrapR"
  )

In this case, the Y-axis is essentially rank - you could also revert back to just doing this as team logos or adding another variable on the y-axis. This plot is ink efficient, but also has a LOT of unused white space as a result. As such, I don’t think it is a “great” plot.

Scatter plots

Back to stealing from Ben - who has done a great job generating interesting scatter plots. Let’s do his cleanup and then some viz. Step 1 cleans up player names and is verbatim copied from his repo.

pbp_players <- pbp_rp %>%
  mutate(
    passer_player_name = ifelse(play_type == "no_play" & pass == 1,
      str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
      passer_player_name
    ),
    receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"),
      str_extract(
        desc,
        "(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"
      ),
      receiver_player_name
    ),
    rusher_player_name = ifelse(play_type == "no_play" & rush == 1,
      str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)|      (up the middle)|(right guard)|(right tackle)|(right end)))"),
      rusher_player_name
    )
  )

Step 2 generates our summary dataframe with a few plays of interest. ALWAYS remember to add an ungroup() as otherwise the grouped assignment lives on in the dataset.

qbs <- pbp_players %>%
  mutate(
    name = ifelse(!is.na(passer_player_name), passer_player_name, rusher_player_name),
    rusher = rusher_player_name,
    receiver = receiver_player_name,
    play = 1
  ) %>%
  group_by(name, posteam) %>%
  summarize(
    n_dropbacks = sum(pass),
    n_rush = sum(rush),
    n_plays = sum(play),
    epa_per_play = sum(epa) / n_plays,
    success_per_play = sum(success) / n_plays
  ) %>%
  filter(n_dropbacks >= 100) %>% 
  ungroup() # always ungroup if you no longer need the grouping effect

Basic Scatterplot

Basic form:

  • ggplot(aes(x = value, y = other_value)) + geom_point()
qb_success_rate <- qbs %>%
  ggplot(aes(x = success_per_play, y = epa_per_play)) +
  geom_point() +
  labs(x = "Success rate",
       y = "EPA per play",
       caption = "Data from nflscrapR",
       title = "QB success rate and EPA/play",
       subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays") +
  theme_bw() +
  theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        plot.title = element_text(size = 16, hjust = 0.5),
        plot.subtitle = element_text(size = 14, hjust = 0.5),
        plot.caption = element_text(size = 12))

qb_success_rate

This is a nice plot, minorly scaled back from Ben’s example code. There is a clear linear relationship between succcess rate (EPA > 0) and EPA per Play, which makes sense.

Add reference lines

We could add back in a few of Ben’s code examples to improve it.

qb_success_rate +
  geom_hline(yintercept = mean(qbs$epa_per_play), color = "red", linetype = "dashed") +
  geom_vline(xintercept =  mean(qbs$success_per_play), color = "red", linetype = "dashed")

This adds lines at the averages for each axis to help with comparison.

We could also accomplish this with the code below. In the below example, it is initially more verbose but also gives you a saved data point to work with, and could be useful if for example you wanted to do a group_by summary or a filter, basically anything beyond just a pure mean.

qb_epa_per_play <- qbs %>%
  summarize(mean = mean(epa_per_play)) %>%
  pull(mean)

qb_success_per_play <- qbs %>%
  summarize(mean = mean(success_per_play)) %>%
  pull(mean)

qb_success_rate +
  geom_hline(yintercept = qb_epa_per_play, color = "red", linetype = "dashed") +
  geom_vline(xintercept = qb_success_per_play, color = "red", linetype = "dashed")

Add linear trendline

We could also add a linear trendline to this plot. Either method shown below is valid, but stat_smooth allows for some additional customization.

qb_success_rate +
  stat_smooth(method = "lm", geom = "line", alpha = 0.5, se = FALSE, color = "red", size = 1)

qb_success_rate +
  geom_smooth(method = "lm", se = FALSE, color = "red")

More than 2 Variables

Now Ben has 2x variables assigned as aesthetics in this plot, success rate as X, EPA/play as Y.

He also added a 3rd variable (size) as an aesthetic. Importantly, because we are putting size and color INSIDE aes() we get to use traditional tidyverse evaluation, so we can reference columns directly, like you see with n_plays and posteam.

qbs %>%
  ggplot(
    aes(x = success_per_play, y = epa_per_play)
  ) +
  # Notice that color/size inside aes()
  geom_point(
    aes(
      color = if_else(posteam == "SF", "red", "black"),
      size = n_plays / 60
    ),
    alpha = 0.50
  ) +
  # we need this to assign red/black to the actual color
  scale_color_identity() +
  labs(
    x = "Success rate",
    y = "EPA per play",
    caption = "Data from nflscrapR",
    title = "QB success rate and EPA/play",
    subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
  ) +
  theme_bw() +
  theme(
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    plot.title = element_text(size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 14, hjust = 0.5),
    plot.caption = element_text(size = 12)
  ) +
  theme(legend.position = "none")

Add labels

We can then add nice labels to ALL the players via ggrepel which automatically repels labels so there is minimal to no overlap.

qbs %>%
  ggplot(aes(x = success_per_play, y = epa_per_play)) +
  # Notice that color/size inside aes()
  geom_point(aes(color = if_else(posteam == "SF", "red", "black"), size = n_plays / 60), alpha = 0.50) +
  # we need this to assign red/black to the actual color
  scale_color_identity() +

  # add labels for all players
  geom_text_repel(aes(label = name, color = if_else(posteam == "SF", "red", "black")),
    force = 1, point.padding = 0.1,
    segment.size = 0.2
  ) +
  labs(
    x = "Success rate",
    y = "EPA per play",
    caption = "Data from nflscrapR",
    title = "QB success rate and EPA/play",
    subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
  ) +
  theme_bw() +
  theme(
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    plot.title = element_text(size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 14, hjust = 0.5),
    plot.caption = element_text(size = 12)
  ) +
  theme(legend.position = "none")

Filter labels

But that’s a LOT of names that we aren’t interested in if we want to talk about just the San Francisco QBs.

qbs %>%
  ggplot(aes(x = success_per_play, y = epa_per_play)) +
  # Notice that color/size inside aes()
  geom_point(aes(color = if_else(posteam == "SF", "red", "black"), size = n_plays / 60), alpha = 0.50) +
  # we need this to assign red/black to the actual color
  scale_color_identity() +

  # add labels JUST for SF
  geom_text_repel(
    data = filter(qbs, posteam == "SF"),
    aes(label = name), color = "red",
    force = 1, point.padding = 0.1,
    segment.size = 0.2
  ) +
  labs(
    x = "Success rate",
    y = "EPA per play",
    caption = "Data from nflscrapR",
    title = "QB success rate and EPA/play",
    subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
  ) +
  theme_bw() +
  theme(
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    plot.title = element_text(size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 14, hjust = 0.5),
    plot.caption = element_text(size = 12)
  ) +
  theme(legend.position = "none")

Add annotations

Staying with our San Francisco example, we can also go about this process differently to answer how did Jimmy G. and Nick the Mullet compare? We can add nice annotations via the ggforce package for just the two players of interest.

qbs %>%
  ggplot(aes(x = success_per_play, y = epa_per_play)) +
  # Notice that color/size inside aes()
  geom_point(aes(
    color = if_else(posteam == "SF", "red", "black"),
    size = n_plays / 60
  ),
  alpha = 0.50
  ) +
  # we need this to assign red/black to the actual color
  scale_color_identity() +
  # add labels JUST for Mullens/Garoppolo with ggforce
  geom_mark_hull(
    aes(
      filter = name %in% c("J.Garoppolo", "N.Mullens"),
      description = "Mullens + Garoppolo performed similarly in 2018"
    ),
    color = "red", label.fontface = "bold", label.colour = "red", con.colour = "red"
  ) +
  labs(
    x = "Success rate",
    y = "EPA per play",
    caption = "Data from nflscrapR",
    title = "QB success rate and EPA/play",
    subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
  ) +
  theme_bw() +
  theme(
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    plot.title = element_text(size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 14, hjust = 0.5),
    plot.caption = element_text(size = 12)
  ) +
  theme(legend.position = "none")

Deep Dive on Scatterplots

Let’s take Ben’s other example of effectiveness when targeting RBs in passing plays. A bit deeper on the data cleaning step here as well.

We will join the play by play data with the roster data to slice by position.

# get from tutorial
rosters <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/roster_data/regular_season/reg_roster_2018.csv") %>%
  filter(position %in% c("WR", "RB", "FB", "TE"), season == 2018) %>% 
  mutate(name = abbr_player_name, posteam = team) %>%
  select(season, name, posteam, position)
## Parsed with column specification:
## cols(
##   season = col_double(),
##   season_type = col_character(),
##   full_player_name = col_character(),
##   abbr_player_name = col_character(),
##   team = col_character(),
##   position = col_character(),
##   gsis_id = col_character()
## )

Join to add position

We can now clean up the data a bit and add the positional data with a left_join(). Ben was also nice enough to share a lot of players that had problematic names, so we can manually assign their position with an if_else.

data_clean <- pbp_rp %>%
  filter(pass == 1 & sack == 0 & qb_scramble == 0) %>%
  select(
    name, pass, desc, posteam, epa, defteam, complete_pass, incomplete_pass,
    air_yards, receiver_player_name, down, success, complete_pass
  ) %>%
  left_join(rosters, by = c("receiver_player_name" = "name", "posteam")) %>%
  mutate(
    qb = ifelse(is.na(position), 0, 1), rec = receiver_player_name,
    drop = if_else(str_detect(desc, "(sacked)|(scramble)"), 1, 0)
  ) %>%
  filter(drop == 0)

problem_wrs <- c(
  "K.Benjamin", "A.Cooper", "G.Tate", "A.Robinson", "B.Marshall",
  "D.Hilliard", "D.Thompson", "De.Thomas", "E.St", "K.Benjamin", "K.Bibbs",
  "Ty.Williams", "W.Snead", "W.Snead IV", "T.Pryor", "E.St. Brown",
  "A.Robinson II", "J.Gordon", "D.Carter", "B.Ellington",
  "A.Holmes", "R.Matthews", "M.Valdes", "V.Bolden"
)

problem_rbs <- c(
  "A.Abdullah", "C.Hyde", "Dam.", "T.Montgomery", "A.Ekeler", "T.Yeldon",
  "Dam. Williams", "Dar.Williams", "R.Jones II", "C.Anderson"
)

# fix a bunch of problem players
pos <- data_clean %>%
  mutate(
    position = if_else(
      rec %in% problem_wrs, "WR", position
    ),
    position = if_else(
      rec %in% problem_rbs, "RB", position
    ),
    position = if_else(position == "FB", "RB", position)
  ) %>%
  filter(!is.na(position), down <= 2)

First try at Air Yards vs EPA

Now that we have the data like we want it - we can make a real quick scatter plot comparing Air Yards to EPA. This is basic, but highlights why we see the hard split in EPA across air yards. Essentially it breaks down to either incomplete (red) or complete passes (blue). This is important to think of down the road. Deeper passes move the needle more in EPA, but are they complete at the same rate?

pos %>%
  ggplot(aes(x = air_yards, y = epa, color = if_else(complete_pass == 1, "blue", "red"))) +
  geom_point() +
  scale_color_identity()
## Warning: Removed 636 rows containing missing values (geom_point).

What about by position?

Let’s start making this more meaningful and try for the comparison of RB vs WR vs TE, also let’s limit it to the more common depths and not limit it to dump passes behind the LOS. So we’ll limit to passes that travel between 1 and 25 yards, and split by position.

pos %>%
  mutate(position = factor(position, levels = c("WR", "RB", "TE"))) %>%
  filter(between(air_yards, 1, 25)) %>%
  ggplot(aes(x = air_yards, y = epa, fill = position)) +
  geom_point() +
  facet_grid(~position)

We see that WRs and TEs clearly get thrown more deep balls, where RB passes are deeply concentrated at 0-5 yards. However due to the colors/opacity we are losing some data clarity.

Change the alpha (transparency)

We add a color by position and make the points mostly transparent so we can see the stacking of points at each depth. However, we don’t need two legends for position and we have more improvements to be made.

pos %>%
  mutate(position = factor(position, levels = c("WR", "RB", "TE"))) %>%
  filter(between(air_yards, 1, 25)) %>%
  ggplot(aes(x = air_yards, y = epa, fill = position)) +
  geom_point(aes(group = air_yards), shape = 21, alpha = 0.2) +
  facet_grid(~position)

Add summary points

We can add median points at each yard by using stat_summary(), this allows us to calculate summary stats and apply as a new layer on top of the existing graph.

pos %>%
  mutate(position = factor(position, levels = c("WR", "TE", "RB"))) %>%
  filter(between(air_yards, 1, 25)) %>%
  ggplot(aes(x = air_yards, y = epa, fill = position)) +
  geom_point(aes(group = air_yards), shape = 21, alpha = 0.2) +
  stat_summary(fun.y = "mean", geom = "point", size = 3, aes(color = position), shape = 21, color = "white", stroke = 1) +
  geom_smooth(color = "white", alpha = 0.5) +
  facet_grid(~position)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

We can see that the WRs and TEs essentially have a linear increase in EPA (even with incompletions) as we go from 0 - 20 air yards, whereas RBs have much higher fluctation after 5 yards due to smaller sample sizes.

Limit the window

Because we essentially are “wasting” so much of the plot to white space for extremely rare situations we can make a judgement call to “zoom” in on the plot with coord_cartesian. Importantly, this doesn’t REMOVE the points, but rather just zooms on the graph. A scale_y_continuous setting of limit would actually remove points and change the fitted curve.

We also:

  • Manually assign the colors to be better versions of the default red, green, blue
  • Add a theme
  • Add titles and context
pos %>%
  mutate(position = factor(position, levels = c("WR", "TE", "RB"))) %>%
  filter(between(air_yards, 1, 25)) %>%
  ggplot(aes(x = air_yards, y = epa, fill = position)) +
  geom_point(aes(group = air_yards), shape = 21, alpha = 0.2) +
  geom_hline(yintercept = 0, size = 1, color = "black") +
  stat_summary(fun.y = "mean", geom = "point", size = 3, aes(color = position), shape = 21, color = "white", stroke = 1) +
  geom_smooth(color = "white", alpha = 0.5) +
  facet_grid(~position) +
  coord_cartesian(ylim = c(-1.5, 5)) +
  scale_y_continuous(breaks = seq(-1.5, 4.5, by = 0.5)) +
  ggthemes::theme_fivethirtyeight() +
  theme(
    legend.position = "none",
    strip.text = element_text(face = "bold")
  ) +
  scale_fill_manual(
    values = c("#00b159", "#003399", "#ff2b4f"),
    aesthetics = c("color", "fill")
  ) +
  labs(
    x = "Air Yards (Depth of Target)",
    y = "EPA\n",
    title = "WR and TE EPA generally increases by depth of target",
    subtitle = "However, RBs generally don't get targeted at these distances!\n\nPasses = 1st/2nd, Air Yards between 1 and 25",
    caption = "Data: @nflscrapR"
  )
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

We can immediately notice a few things - because we still show the points behind the mean we see that the completion rate shifts after about 20 air yards for all positions, which corresponds to a major shift in EPA. We can see that there is a major sweet spot for WRs/TEs at 10-20ish air yards where the EPA is maximized, but beyond that plays can get into the home run or bust category.

We also see that RBs are essentially rarely targeted past 5 yards, and although they have some high peaks these are due to just a few successful plays in a limited sample size.

Overall this plot gives us a nice balance between showing the summary level data AND the distribution behind the plot.

Speaking of distributions - we’ll get deeper in those in the next section!

Let’s make 1 more scatterplot though!

Completion rate by Depth of Target

Keeping with the same idea, we can now add in another variable to our plot.

  • Variable 1 = x axis (air yards)
  • Variable 2 = y axis (completion rate)
  • Variable 3 = color/fill (position)
  • Variable 4 = size (number of passes)

Four variables gives us lots of room to create a rich visualization, but we need to be careful to highlight what size represents. We can change the legend position and title as seen below.

Create summary stats

We can summarize the completion rate by air yards and position through the below script. We’ll limit it to between 1 and 25 air yards again.

pass_comp <- pos %>%
  mutate(position = factor(position, levels = c("WR", "TE", "RB"))) %>%
  filter(between(air_yards, 1, 25)) %>%
  group_by(position, air_yards) %>%
  summarize(
    n = n(),
    comp_rate = sum(complete_pass, na.rm = TRUE) / n,
    epa = mean(epa, na.rm = TRUE)
  )

First plot

We’ll go right ahead and add our nice colors, labels, themes, etc and focus on changing just the legend information.

pass_comp_plot <- pass_comp %>%
  ggplot(aes(x = air_yards, y = comp_rate, fill = position)) +
  geom_point(aes(size = n), shape = 21) +
  geom_smooth(color = "white") +
  geom_hline(yintercept = 0, size = 1, color = "black") +
  geom_vline(xintercept = 20, size = 1, color = "black", linetype = "dashed", alpha = 0.5) +
  geom_hline(yintercept = 0.5, size = 1, color = "black", linetype = "dashed", alpha = 0.5) +
  facet_grid(~position) +
  ggthemes::theme_fivethirtyeight() +
  scale_fill_manual(
    values = c("#00b159", "#003399", "#ff2b4f"),
    aesthetics = c("color", "fill")
  ) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    x = "Air Yards (Depth of Target)",
    y = "EPA\n",
    title = "Completion rate by Depth of Target on 1st/2nd Down",
    subtitle = "Completion rate generally drops below 50% for passes > 20 air yards",
    caption = "Graph: @thomas_mock | Data: @nflscrapR",
    size = "N of Passes"
  ) +
  theme(strip.text = element_text(face = "bold"))

pass_comp_plot
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Move the legend

We can move the legend via legend.position - where the vector is now x, y position from 0 to 1. We can also drop the legend for JUST color and fill, as position is already indicated by the headers of the facets.

pass_comp_plot +
  guides(color = FALSE, fill = FALSE) +
  theme(
    legend.direction = "vertical",
    legend.position = c(0.1, 0.2),
    legend.background = element_blank(),
    legend.title = element_text(face = "bold")
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Alright - enough on scatter plots, let’s look at how to approach the distribution of the data alone.

Distributions

Rather than summarizing data into columns/points we can also display the distribution of the data points. The most common distribution plots are:

  • Histograms
  • Density plots

You could also consider a stacked boxplot + jitter plot as showing the distribution.

Histogram

The basic idea of a histogram is that the data is binned along some range (2 airyards in below example) across the x axis, and all values of x that fall within this count add to the total count for that specific bin.

Basic form:

  • ggplot(aes(x = value)) + geom_histogram()
    • Notice NO y-value as it is generated automatically via the plot

Let’s take a look at KC and SEA, teams with very different approaches to their offenses.

sea_color <- teamcolors %>%
  filter(name == "Seattle Seahawks") %>%
  pull(primary)

kc_color <- teamcolors %>%
  filter(name == "Kansas City Chiefs") %>%
  pull(primary)

pbp_rp %>%
  filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
  group_by(posteam, play_type) %>%
  summarize(n = n()) %>%
  mutate(freq = n / sum(n))
## # A tibble: 4 x 4
## # Groups:   posteam [2]
##   posteam play_type     n  freq
##   <chr>   <chr>     <int> <dbl>
## 1 KC      pass        606 0.619
## 2 KC      run         373 0.381
## 3 SEA     pass        465 0.474
## 4 SEA     run         515 0.526

So KC threw the ball almost 62% of the time, while Sea only threw the ball about 47% of the time!

But what does the distribution of throws look like between KC and SEA?

pbp_rp %>%
  filter(play_type == "pass") %>%
  filter(posteam %in% c("SEA", "KC")) %>%
  ggplot(aes(x = air_yards, fill = posteam)) +
  geom_histogram(binwidth = 2)
## Warning: Removed 79 rows containing non-finite values (stat_bin).

The basic histogram is “fine” but let’s spruce it up a bit! We can add our theme, the team colors, and some better labels.

pbp_rp %>%
  filter(play_type == "pass") %>%
  filter(posteam %in% c("SEA", "KC")) %>%
  ggplot(aes(x = air_yards, fill = posteam)) +
  geom_histogram(binwidth = 2, alpha = 0.9) +
  scale_fill_manual(values = c(kc_color, sea_color)) +
  geom_hline(yintercept = 0, size = 1) +
  theme_538() +
  theme(
    legend.title = element_blank(),
    legend.position = c(0.6, 0.9)
  ) +
  scale_x_continuous(breaks = seq(-10, 60, 10)) +
  labs(
    x = "\nAir Yards",
    y = "Count",
    title = "KC threw more passes at all ranges",
    caption = "Data: @nflscrapR"
  )
## Warning: Removed 79 rows containing non-finite values (stat_bin).

Density Plot

“Computes and draws kernel density estimate, which is a smoothed version of the histogram. This is a useful alternative to the histogram for continuous data that comes from an underlying smooth distribution.” - ggplot2 docs

Basic form:

  • ggplot(aes(x = value)) + geom_density()
    • Again notice NO y-value, it is generated automatically
pbp_rp %>%
  filter(play_type == "pass") %>%
  filter(posteam %in% c("SEA", "KC")) %>%
  ggplot(aes(x = air_yards, fill = posteam)) +
  geom_density(alpha = 0.8) +
  scale_fill_manual(values = c(kc_color, sea_color)) +
  theme_538() +
  theme(
    legend.title = element_blank(),
    legend.position = c(0.6, 0.9)
  ) +
  scale_x_continuous(breaks = seq(-10, 60, 10))
## Warning: Removed 79 rows containing non-finite values (stat_density).

An important point - I try not to focus on the Y axis for either histogram/density plots as we are looking at the distribution itself rather than specific numbers. You can scale out the y-axis in a few ways for density plots, which I’ll demonstrate below.

pbp_rp %>%
  filter(play_type == "pass") %>%
  filter(posteam %in% c("SEA", "KC")) %>%
  ggplot(aes(x = air_yards, y = ..scaled.., fill = posteam)) +
  geom_density(alpha = 0.8) +
  scale_fill_manual(values = c(kc_color, sea_color)) +
  theme_538() +
  theme(
    legend.title = element_blank(),
    legend.position = c(0.6, 0.9)
  ) +
  scale_x_continuous(breaks = seq(-10, 60, 10))
## Warning: Removed 79 rows containing non-finite values (stat_density).

Interestingly though, we see that KC and SEA essentially attacked the field in the same way, BUT SEA threw so many fewer passes which was captured in the histogram.

Ridge plots

A nice addon to density plots is through the ggridges package, which allows for the creation of stacked density and histogram plots.

Basic form:

  • ggplot(aes(x = value, y = category)) + geom_density_ridges()
    • Notice now we assign a group/category to Y axis
pbp_rp %>%
  filter(play_type == "pass") %>%
  filter(posteam %in% c("SEA", "KC")) %>%
  ggplot(aes(x = air_yards, y = posteam, fill = posteam)) +
  geom_density_ridges() +
  scale_fill_manual(values = c(kc_color, sea_color)) +
  theme_538() +
  theme(
    panel.grid.major.y = element_blank(),
    legend.position = "none"
  ) +
  scale_x_continuous(breaks = seq(-10, 60, 10)) +
  labs(
    x = "Air Yards",
    y = "",
    title = "SEA and KC pass to similar depths of the field",
    caption = "Data: @nflscrapR"
  )
## Picking joint bandwidth of 2.53
## Warning: Removed 79 rows containing non-finite values
## (stat_density_ridges).

Boxplots

Boxplots are another way of showing central tendency + range of a distribution, but they can still have their quirks or difficulties in explanations. I typically find that adding a geom_jitter() call on top of the boxplot helps with showing both the distribution and the central tendency/range, but YMMV.

Basic form:

  • ggplot(aes(x = category, y = value)) + geom_boxplot()
pbp_rp %>%
  filter(play_type == "pass") %>%
  filter(posteam %in% c("SEA", "KC")) %>%
  ggplot(aes(x = posteam, y = air_yards, fill = posteam)) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.2) +
  scale_fill_manual(values = c(kc_color, sea_color)) +
  theme_538() +
  theme(legend.position = "none")
## Warning: Removed 79 rows containing non-finite values (stat_boxplot).
## Warning: Removed 79 rows containing missing values (geom_point).

pbp_rp %>%
  filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>% 
  ggplot(aes(x = play_type, y = epa , fill = play_type)) +
  geom_boxplot() +
  geom_jitter(width = 0.3, alpha = 0.1) +
  scale_fill_manual(values = c(kc_color, sea_color)) +
  theme_538() +
  theme(legend.position = "none") +
  facet_grid(~posteam)

Sina plot

geom_sina() from the ggforce package is an alternative to the above wokflow, and is somewhat similar to a vertical geom_density()

Basic form:

  • ggplot(aes(x = category, y = value)) + geom_sina()
pbp_rp %>%
  filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
  ggplot(aes(x = play_type, y = epa, color = posteam)) +
  geom_sina(alpha = 0.5) +
  scale_fill_manual(values = c(kc_color, sea_color), aesthetics = c("fill", "color")) +
  theme_538() +
  theme(legend.position = "none") +
  facet_grid(~posteam)

Beeswarm

geom_beeswarm() from the ggbeeswarm package is an alternative to the above wokflow, but is almost identical to geom_sina() in its basic form.

Basic form:

  • ggplot(aes(x = category, y = value)) + geom_beeswarm()
library(ggbeeswarm)
pbp_rp %>%
  filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
  ggplot(
    aes(x = play_type, y = epa, color = posteam)
  ) +
  geom_beeswarm(alpha = 0.5) +
  scale_fill_manual(values = c(kc_color, sea_color), aesthetics = c("fill", "color")) +
  theme_538() +
  theme(legend.position = "none") +
  facet_grid(~posteam)

Importantly, although this looks very similar to the geom_sina() plots, you have more options about customizing the “swarming”.

pbp_rp %>%
  filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
  ggplot(
    aes(x = play_type, y = epa, color = posteam)
  ) +
  geom_beeswarm(priority = "random", alpha = 0.5, size = 0.5) +
  scale_fill_manual(values = c(kc_color, sea_color), aesthetics = c("fill", "color")) +
  theme_538() +
  theme(legend.position = "none") +
  facet_grid(~posteam)

Miscellaneous Plots

Dumbbell Plots

Dumbell plots are typically best served comparing two summary numbers within a group.

Basic form:

  • ggplot(aes(x = value, y = category, group = group)) + geom_line() + geom_point() + coord_flip()
rush_v_pass <- pbp_rp %>% 
  filter(play_type != "no_play", penalty == 0) %>% 
  group_by(play_type, posteam) %>% 
  summarize(avg_yds = mean(yards_gained, na.rm = TRUE)) %>% 
  ungroup()

nfl_rvp <- pbp_rp %>% 
  filter(play_type != "no_play") %>% 
  group_by(play_type) %>% 
  summarize(avg_yds = mean(yards_gained, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(posteam = "NFL")

rush_v_pass <- bind_rows(rush_v_pass, nfl_rvp) %>% 
  mutate(play_type = factor(play_type,
                            levels = c("pass", "run"),
                            labels = c("Pass", "Rush")))

rush_v_pass %>% 
  ggplot(aes(x = fct_rev(fct_reorder2(posteam, desc(play_type), avg_yds)), y = avg_yds, color = play_type)) +
  geom_line(aes(group = posteam), color = "grey", size = 3) +
  geom_point(size = 5) +
  coord_flip()

Adding some additional aesthetic changes to improve the graph:

  • Color of the points
  • Titles/Axes/Captions
  • Direct label legend
rush_v_pass %>%
  ggplot(aes(x = fct_rev(fct_reorder2(posteam, desc(play_type), avg_yds)), y = avg_yds, color = play_type)) +
  geom_line(aes(group = posteam), color = "grey", size = 3) +
  geom_point(size = 5) +
  geom_text(
    data = filter(rush_v_pass, posteam == "KC" & play_type == "Pass"),
    aes(label = play_type),
    hjust = 0, nudge_y = 0.2, fontface = "bold", size = 6
  ) +
  geom_text(
    data = filter(rush_v_pass, posteam == "KC" & play_type == "Rush"),
    aes(label = play_type),
    hjust = 1, nudge_y = -0.2, fontface = "bold", size = 6
  ) +
  coord_flip() +
  scale_color_manual(values = c("#003399", "#ff2b4f")) +
  theme_538() +
  theme(
    panel.grid.major.y = element_blank(),
    legend.position = "none",
    axis.text.y = element_text(color = if_else(rush_v_pass$posteam == "NFL", "red", "black"))
  ) +
  labs(
    x = "",
    y = "\n Average Yards Gained",
    title = "Passing yards per play outperforms Rushing for all teams",
    caption = "Data: @nflscrapR"
  ) +
  scale_y_continuous(
    limits = c(3, 9),
    breaks = seq(3, 8, 1)
  )

Slope Chart

A slope chart allows you to show the change/trend between two points, most appropriately as a two point time series.

The basic form:

  • ggplot(aes(x = time, y = value, group = group)) + geom_line() + geom_point()

This demo takes a bit of prep, you could skip this simply by using game_date instead of game_week, but I think it’s a useful code-chunk for you to have in the boolbox.

  • case_when() is essentially a switch or a combination of if_else() statements.

We can use it to have a bunch of arguments where you can match some argument and then output something specific. Here we are checking if the game_date is between the two dates for that week’s games and then assigning game_week of the season.

game_num <- pbp_rp %>% 
  mutate(game_week = case_when(
    between(game_date, as.Date("2018-09-05"), as.Date("2018-09-11")) ~ 1,
    between(game_date, as.Date("2018-09-12"), as.Date("2018-09-18")) ~ 2,
    between(game_date, as.Date("2018-09-19"), as.Date("2018-09-25")) ~ 3,
    between(game_date, as.Date("2018-09-26"), as.Date("2018-10-02")) ~ 4,
    between(game_date, as.Date("2018-10-03"), as.Date("2018-10-09")) ~ 5,
    between(game_date, as.Date("2018-10-10"), as.Date("2018-10-16")) ~ 6,
    between(game_date, as.Date("2018-10-17"), as.Date("2018-10-23")) ~ 7,
    between(game_date, as.Date("2018-10-24"), as.Date("2018-10-30")) ~ 8,
    between(game_date, as.Date("2018-10-31"), as.Date("2018-11-06")) ~ 9,
    between(game_date, as.Date("2018-11-07"), as.Date("2018-11-13")) ~ 10,
    between(game_date, as.Date("2018-11-14"), as.Date("2018-11-20")) ~ 11,
    between(game_date, as.Date("2018-11-21"), as.Date("2018-11-27")) ~ 12,
    between(game_date, as.Date("2018-11-28"), as.Date("2018-12-04")) ~ 13,
    between(game_date, as.Date("2018-12-05"), as.Date("2018-12-11")) ~ 14,
    between(game_date, as.Date("2018-12-12"), as.Date("2018-12-18")) ~ 15,
    between(game_date, as.Date("2018-12-19"), as.Date("2018-12-25")) ~ 16,
    between(game_date, as.Date("2018-12-30"), as.Date("2019-01-01")) ~ 17,
    TRUE ~ 99
    )
  ) %>% 
  filter(game_week != 99)

Now we can clean up some of the factors for better printing and limit to KC, the most offensively efficient team in 2018. We’ll be looking at 1st Half vs 2nd Half Runs & Passes.

wk_rvp <- game_num %>%
  filter(play_type != "no_play", game_half %in% c("Half1", "Half2")) %>%
  mutate(game_half = if_else(game_half == "Half1", "1st Half", "2nd Half")) %>%
  group_by(posteam, game_half, game_week, play_type) %>%
  count() %>% 
  ungroup()

kc_rvp <- wk_rvp %>% 
  filter(posteam == "KC") %>% 
  mutate(game_num = if_else(game_week <=11, game_week, game_week - 1),
         play_type = if_else(play_type == "run", "Rush", "Pass"),
         game_text = glue::glue("Game {game_num}")
         )

kc_rvp
## # A tibble: 64 x 7
##    posteam game_half game_week play_type     n game_num game_text
##    <chr>   <chr>         <dbl> <chr>     <int>    <dbl> <glue>   
##  1 KC      1st Half          1 Pass         15        1 Game 1   
##  2 KC      1st Half          1 Rush         11        1 Game 1   
##  3 KC      1st Half          2 Pass         11        2 Game 2   
##  4 KC      1st Half          2 Rush         10        2 Game 2   
##  5 KC      1st Half          3 Pass         27        3 Game 3   
##  6 KC      1st Half          3 Rush         14        3 Game 3   
##  7 KC      1st Half          4 Pass         16        4 Game 4   
##  8 KC      1st Half          4 Rush         14        4 Game 4   
##  9 KC      1st Half          5 Pass         25        5 Game 5   
## 10 KC      1st Half          5 Rush         11        5 Game 5   
## # … with 54 more rows

Notice that we have nice text for game_half and play_type and a game_num variable. Let’s build the basic slope chart. You need SOME type of grouping variable as your 3rd variable eg (var1 = x, var2 = y, var3 = group).

kc_rvp %>% 
  ggplot(aes(x = game_half, y = n, group = game_num)) +
  geom_point() +
  geom_line() +
  facet_grid(~play_type)

This is interesting and shows the key feature of how does the trend of 1st Half vs 2nd Half Rush vs Pass look like.

However, we aren’t sure which games match to fewer first-half rushes vs more second-half rushes We need to assign a color variable in our aes() call. But first, let’s figure out the games where KC rushed more in the 1st half (aka ESTABLISH THE RUN) vs the 2nd half (REAP THE REWARDS OF ESTABLISHMENT).

kc_runs <- kc_rvp %>% 
  filter(play_type == "Rush") %>% 
  spread(game_half, n) %>%
  mutate(balance = if_else(`1st Half` >= `2nd Half`, "Ran More in 1st", "Ran More in 2nd")) %>% 
  gather(key = "game_half", value = "n", `1st Half`:`2nd Half`) %>% 
  select(posteam, game_num,game_half, balance)

kc_runs 
## # A tibble: 32 x 4
##    posteam game_num game_half balance        
##    <chr>      <dbl> <chr>     <chr>          
##  1 KC             1 1st Half  Ran More in 2nd
##  2 KC             2 1st Half  Ran More in 2nd
##  3 KC             3 1st Half  Ran More in 1st
##  4 KC             4 1st Half  Ran More in 1st
##  5 KC             5 1st Half  Ran More in 2nd
##  6 KC             6 1st Half  Ran More in 2nd
##  7 KC             7 1st Half  Ran More in 2nd
##  8 KC             8 1st Half  Ran More in 2nd
##  9 KC             9 1st Half  Ran More in 2nd
## 10 KC            10 1st Half  Ran More in 2nd
## # … with 22 more rows
kc_runs %>% 
  filter(balance == "Ran More in 1st") %>% 
  distinct(game_num)
## # A tibble: 4 x 1
##   game_num
##      <dbl>
## 1        3
## 2        4
## 3       11
## 4       15

So games 3, 4, 11 and 15 were the only games were they ran more or equal amounts in the 1st Half than the 2nd Half. Fun fact - KC went 12-4 last season, and went 2-2 in the games where they ran equally or more in the 1st Half than the 2nd Half.

kc_rvp %>%
  ggplot(
    aes(
      x = game_half, y = n, group = game_week, 
      color = if_else(game_num %in% c(3, 4, 11, 15), "red", "blue")
    )
  ) +
  geom_point() +
  geom_line() +
  geom_text_repel(
    data = filter(
      kc_rvp, game_num %in% c(3, 4, 11, 15),
      game_half == "2nd Half"
    ),
    aes(label = game_num)
  ) +
  facet_grid(~play_type) +
  scale_color_identity()

Ok now we can see that in 3/4 of the ONLY games where they “established” the run in the 1st half they ended up passing dramatically more in the 2nd half (4, 11, 15). Let’s add some more context and details.

kc_rvp %>%
  ggplot(
    aes(
      x = game_half, y = n, group = game_week, 
      color = if_else(game_num %in% c(3, 4, 11, 15), "#ff2b4f", "#003399")
    )
  ) +
  geom_point() +
  geom_vline(xintercept = c(1, 2), size = 2, color = "black", alpha = 0.5) +
  geom_line(size = 2) +
  geom_point(size = 5) +
  geom_text_repel(
    data = filter(
      kc_rvp, game_num %in% c(3, 4, 11, 15),
      game_half == "2nd Half"
    ),
    aes(label = game_text),
    direction = "y", nudge_x = 0.1, segment.size = 0.1, hjust = 0,
    size = 5, fontface = "bold"
  ) +
  facet_grid(~play_type) +
  scale_color_identity() +
  theme_538() +
  theme(panel.grid.major.x = element_blank()) +
  labs(x = "", y = "N of Plays\n",
       title = "In 3 of 4 games where KC established the run they ended up throwing more in the 2nd half",
       subtitle = "They went 2-2 in these games, and 10-2 in their other games",
       caption = "Data: @nflscrapR")

The big players here are manually changing colors and adding filtered data to add text labels for only the points of interest.

That’s all for this section - on to Tables!

Tables

You can create beautiful static and interactive tables in R through the gt and DT packages respectively!

gt

The gt package is essentially a grammar of tables, allowing you to quickly build out tables and output to RTF, HTML, or LaTeX.

Let’s do a quick analysis!

Basic Table

Let’s go back to our schotty example!

schotty
## # A tibble: 32 x 3
##    posteam mean_pass plays
##    <chr>       <dbl> <int>
##  1 SEA         0.369   320
##  2 JAX         0.435   276
##  3 TEN         0.441   263
##  4 BUF         0.452   219
##  5 BAL         0.458   299
##  6 ARI         0.466   236
##  7 NYJ         0.473   256
##  8 DET         0.482   299
##  9 WAS         0.485   239
## 10 CAR         0.491   281
## # … with 22 more rows

We can quickly convert this to a table!

schotty %>%
  slice(1:5, 28:32) %>%
  gt()
posteam mean_pass plays
SEA 0.3687500 320
JAX 0.4347826 276
TEN 0.4410646 263
BUF 0.4520548 219
BAL 0.4581940 299
TB 0.5847176 301
PHI 0.5855263 304
GB 0.5939850 266
KC 0.6342412 257
PIT 0.6634304 309

And then we can make some changes!

schotty_gt <- schotty %>%
  slice(1:5, 28:32) %>%
  arrange(desc(mean_pass)) %>%
  mutate(play_focus = if_else(mean_pass >= .50, "Pass Heavy", "Run Heavy")) %>%
  group_by(play_focus) %>%
  gt()

schotty_gt
posteam mean_pass plays
Pass Heavy
PIT 0.6634304 309
KC 0.6342412 257
GB 0.5939850 266
PHI 0.5855263 304
TB 0.5847176 301
Run Heavy
BAL 0.4581940 299
BUF 0.4520548 219
TEN 0.4410646 263
JAX 0.4347826 276
SEA 0.3687500 320

Fancier Table

schotty_gt %>%
  fmt_percent(columns = vars(mean_pass), decimals = 1) %>%
  tab_header(
    title = "Percentage of Passes by teams on 1st/2nd Down in 1st Half",
    subtitle = "Win Prob between 20 & 80, excludes final 2 minutes of the half"
  ) %>%
  cols_label(
    posteam = "Player",
    mean_pass = "Pass %",
    plays = "Plays"
  ) %>%
  cols_align(
    align = "center"
  ) %>%
  tab_source_note(
    source_note = "Table: @thomas_mock | Data: @nflscrapR"
  )
Percentage of Passes by teams on 1st/2nd Down in 1st Half
Win Prob between 20 & 80, excludes final 2 minutes of the half
Player Pass % Plays
Pass Heavy
PIT 66.3% 309
KC 63.4% 257
GB 59.4% 266
PHI 58.6% 304
TB 58.5% 301
Run Heavy
BAL 45.8% 299
BUF 45.2% 219
TEN 44.1% 263
JAX 43.5% 276
SEA 36.9% 320
Table: @thomas_mock | Data: @nflscrapR

More customization

For this example, we’ll grab just some specific players:
* Primarily Slot Receivers
* Stud RBs
* Stud TEs

And compare their performance when catching the ball on 3rd down, with a few specific criteria.

# 2018 and pass plays
pass_2018 <- pbp_rp %>%
  filter(play_type == "pass", penalty == 0, sack == 0, qb_scramble == 0)

third_down_passes <- pass_2018 %>%
  filter(down == 3, ydstogo <= 10) %>%
  group_by(receiver_player_name) %>%
  mutate(converted = if_else(yards_gained > ydstogo, 1, 0)) %>%
  select(receiver_player_name, yards_gained, ydstogo, epa, converted) %>%
  summarise(
    mean_epa = mean(epa, na.rm = TRUE),
    mean_yardage = mean(yards_gained, na.rm = TRUE),
    mean_ydstogo = mean(ydstogo, na.rm = TRUE),
    n = n(),
    conv_rate = sum(converted) / n
  ) %>%
  ungroup() %>%
  arrange(desc(conv_rate))

rbs <- c(
  "A.Kamara", "J.White", "J.Conner", "C.McCaffrey", "S.Barkley", "E.Elliott",
  "J.Mixon", "T.Gurley", "D.Johnson", "M.Gordon"
)

wrs <- c(
  "D.Westbrook", "A.Humphries", "C.Kupp", "G.Tate", "D.Pettis", "J.Edelman",
  "C.Kupp", "W.Snead IV", "M.Sanu", "T.Lockett", "T.Gabriel", "S.Shepard", "C.Beasley"
)

tes <- c("T.Kelce", "Z.Ertz", "G. Kittle", "E.Engram", "J.Cook", "E.Ebron")

top_players <- c(rbs, wrs, tes)

Now that we have the dataframe setup, we can create a quick table.

third_conv_table <- third_down_passes %>% 
  filter(n >= 10) %>% 
  mutate(position = case_when(
    receiver_player_name %in% rbs ~ "RB",
    receiver_player_name %in% wrs ~ "WR",
    receiver_player_name %in% tes ~ "TE",
    TRUE ~ NA_character_
  ),
  position = factor(position, levels = c("RB", "WR", "TE"))
  ) %>%
  filter(receiver_player_name %in% top_players) %>% 
  select(receiver_player_name, conv_rate, n,  everything(), -mean_epa) %>% 
  group_by(position) %>% 
  arrange(desc(conv_rate)) %>% 
  ungroup() %>% 
  gt::gt(groupname_col = "position") 

third_conv_table
receiver_player_name conv_rate n mean_yardage mean_ydstogo
WR
C.Kupp 0.6666667 12 14.500000 5.750000
W.Snead IV 0.6470588 17 8.294118 5.882353
M.Sanu 0.6363636 22 7.818182 5.590909
T.Lockett 0.6315789 19 15.052632 5.736842
T.Gabriel 0.6250000 16 8.812500 6.250000
D.Westbrook 0.6206897 29 7.482759 5.620690
A.Humphries 0.5500000 20 5.750000 5.550000
C.Beasley 0.5384615 26 8.153846 6.038462
J.Edelman 0.4000000 20 5.550000 4.850000
D.Pettis 0.3846154 13 11.384615 6.769231
G.Tate 0.3750000 32 6.312500 5.531250
S.Shepard 0.3571429 28 5.892857 5.607143
TE
T.Kelce 0.6296296 27 10.111111 5.703704
E.Ebron 0.5357143 28 8.785714 5.821429
J.Cook 0.5185185 27 8.555556 5.851852
E.Engram 0.4705882 17 11.058824 5.352941
Z.Ertz 0.4000000 35 5.571429 5.742857
RB
C.McCaffrey 0.5263158 19 8.684211 5.052632
J.White 0.4285714 28 5.250000 5.821429
S.Barkley 0.4000000 20 6.500000 4.600000
A.Kamara 0.3913043 23 6.043478 6.173913
D.Johnson 0.3636364 33 4.757576 5.363636
E.Elliott 0.3333333 15 3.533333 5.333333

And then really amp it up with further customizations!

third_conv_table %>% 
  tab_header(
    title = "3rd Down Conversion Rates (Slot WR vs RB vs TE)",
    subtitle = "Yds to go <= 10, N of Plays >= 10"
  ) %>% 
  fmt_percent(.,
              columns = vars(conv_rate),
              decimals = 1
  ) %>% 
  fmt_number(
    columns = vars(mean_yardage, mean_ydstogo),
    decimals = 1
  ) %>% 
  cols_label(
    receiver_player_name = "Player",
    mean_yardage = "Yds Gained",
    mean_ydstogo = "Yds to Go",
    n = "Plays",
    conv_rate = "Conversion Rate"
  ) %>% 
  cols_align(
    align = "center"
  ) %>% 
  tab_source_note(
    source_note = "Table: @thomas_mock | Data: @nflscrapR"
  ) %>% 
  tab_footnote(
    footnote = "Average Yards",
    locations = cells_column_labels(
      columns = vars(mean_yardage, mean_ydstogo)
    )
  )
3rd Down Conversion Rates (Slot WR vs RB vs TE)
Yds to go <= 10, N of Plays >= 10
Player Conversion Rate Plays Yds Gained1 Yds to Go1
WR
C.Kupp 66.7% 12 14.5 5.8
W.Snead IV 64.7% 17 8.3 5.9
M.Sanu 63.6% 22 7.8 5.6
T.Lockett 63.2% 19 15.1 5.7
T.Gabriel 62.5% 16 8.8 6.2
D.Westbrook 62.1% 29 7.5 5.6
A.Humphries 55.0% 20 5.8 5.5
C.Beasley 53.8% 26 8.2 6.0
J.Edelman 40.0% 20 5.5 4.8
D.Pettis 38.5% 13 11.4 6.8
G.Tate 37.5% 32 6.3 5.5
S.Shepard 35.7% 28 5.9 5.6
TE
T.Kelce 63.0% 27 10.1 5.7
E.Ebron 53.6% 28 8.8 5.8
J.Cook 51.9% 27 8.6 5.9
E.Engram 47.1% 17 11.1 5.4
Z.Ertz 40.0% 35 5.6 5.7
RB
C.McCaffrey 52.6% 19 8.7 5.1
J.White 42.9% 28 5.2 5.8
S.Barkley 40.0% 20 6.5 4.6
A.Kamara 39.1% 23 6.0 6.2
D.Johnson 36.4% 33 4.8 5.4
E.Elliott 33.3% 15 3.5 5.3
Table: @thomas_mock | Data: @nflscrapR

1 Average Yards

The end

Thanks again for looking through this and hopefully this is helpful, if you have any suggestions - feel free to reference the GitHub repo and share additional examples!

License

This work, “Tom’s Cookbook for Better Viz”, is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.